47 research outputs found
dynamic application autotuning for self aware approximate computing
The energy consumption limits the application performance in a wide range of scenarios, ranging from embedded to High-Performance Computing. To improve computation efficiency, this Chapter focuses on a software-level methodology to enhance a target application with an adaptive layer that provides self-optimization capabilities. We evaluated the benefits of dynamic autotuning in three case studies: a probabilistic time-dependent routing application from a navigation system, a molecular docking application to perform virtual-screening, and a stereo-matching application to compute the depth of a three-dimensional scene. Experimental results show how it is possible to improve computation efficiency by adapting reactively and proactively
Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications
Due to the increasing size of HPC machines, the fault presence is becoming an
eventuality that applications must face. Natively, MPI provides no support for
the execution past the detection of a fault, and this is becoming more and more
constraining. With the introduction of ULFM (User Level Fault Mitigation
library), it has been provided with a possible way to overtake a fault during
the application execution at the cost of code modifications. ULFM is intrusive
in the application and requires also a deep understanding of its recovery
procedures.
In this paper we propose Legio, a framework that lowers the complexity of
introducing resiliency in an embarrassingly parallel MPI application. By hiding
ULFM behind the MPI calls, the library is capable to expose resiliency features
to the application in a transparent manner thus removing any integration
effort. Upon fault, the failed nodes are discarded and the execution continues
only with the non-failed ones. A hierarchical implementation of the solution
has been also proposed to reduce the overhead of the repair process when
scaling towards a large number of nodes.
We evaluated our solutions on the Marconi100 cluster at CINECA, showing that
the overhead introduced by the library is negligible and it does not limit the
scalability properties of MPI. Moreover, we also integrated the solution in
real-world applications to further prove its robustness by injecting faults
An Efficient Monte Carlo-based Probabilistic Time-Dependent Routing Calculation Targeting a Server-Side Car Navigation System
Incorporating speed probability distribution to the computation of the route
planning in car navigation systems guarantees more accurate and precise
responses. In this paper, we propose a novel approach for dynamically selecting
the number of samples used for the Monte Carlo simulation to solve the
Probabilistic Time-Dependent Routing (PTDR) problem, thus improving the
computation efficiency. The proposed method is used to determine in a proactive
manner the number of simulations to be done to extract the travel-time
estimation for each specific request while respecting an error threshold as
output quality level. The methodology requires a reduced effort on the
application development side. We adopted an aspect-oriented programming
language (LARA) together with a flexible dynamic autotuning library (mARGOt)
respectively to instrument the code and to take tuning decisions on the number
of samples improving the execution efficiency. Experimental results demonstrate
that the proposed adaptive approach saves a large fraction of simulations
(between 36% and 81%) with respect to a static approach while considering
different traffic situations, paths and error requirements. Given the
negligible runtime overhead of the proposed approach, it results in an
execution-time speedup between 1.5x and 5.1x. This speedup is reflected at
infrastructure-level in terms of a reduction of around 36% of the computing
resources needed to support the whole navigation pipeline
Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs
Virtual screening is an early stage in the drug discovery process that selects the most promising candidates. In the urgent computing scenario, finding a solution in the shortest time frame is critical. Any improvement in the performance of a virtual screening application translates into an increase in the number of candidates evaluated, thereby raising the probability of finding a drug. In this paper, we show how we can improve application throughput using Out-of-kernel optimizations. They use input features, kernel requirements, and architectural features to rearrange the kernel inputs, executing them out of order, to improve the computation efficiency. These optimizations’ implementations are designed on an extreme-scale virtual screening application, named LiGen, that can hinge on CUDA and SYCL kernels to carry out the computation on modern supercomputer nodes. Even if they are tailored to a single application, they might also be of interest for applications that share a similar design pattern. The experimental results show how these optimizations can increase kernel performance by 2 X, respectively, up to 2.2X in CUDA and up to 1.9X, in SYCL. Moreover, the reported speedup can be achieved with the best-proposed parameterization, as shown by the data we collected and reported in this manuscript
GPU-optimized approaches to molecular docking-based virtual screening in drug discovery: A comparative analysis
Finding a novel drug is a very long and complex procedure. Using computer simulations, it is possible to accelerate the preliminary phases by performing a virtual screening that filters a large set of drug candidates to a manageable number. This paper presents the implementations and comparative analysis of two GPU-optimized implementations of a virtual screening algorithm targeting novel GPU architectures. This work focuses on the analysis of parallel computation patterns and their mapping onto the target architecture. The first method adopts a traditional approach that spreads the computation for a single molecule across the entire GPU. The second uses a novel batched approach that exploits the parallel architecture of the GPU to evaluate more molecules in parallel. Experimental results showed a different behavior depending on the size of the database to be screened, either reaching a performance plateau sooner or having a more extended initial transient period to achieve a higher throughput (up to 5x), which is more suitable for extreme-scale virtual screening campaigns
Evaluating Orthogonality between Application Auto-Tuning and Run-Time Resource Management for Adaptive OpenCL Applications
Abstract-The ever increasing number of processing units integrated on the same many-core chip delivers computational power that can exceed the performance requirements of a single application. The number of chips (and related power consumption) can thus be reduced to serve multiple applications -a practice which is called resource consolidation. However, this solution requires techniques to partition and assign resources among the applications and to manage unpredictable dynamic workloads. To provide the performance requirements in such scenarios, we exploit application auto-tuning, based on design-time analysis, of both application-specific dynamic knobs and computational parallelism. Such features are implemented in a software library, which is used to demonstrate the main contribution of this paper: a light-weight Run-Time Resource Management -RTRM -technique to improve resource sharing for computationally intensive OpenCL applications. We evaluate how much the interaction between RTRM and application auto-tuning can become synergistic yet orthogonal. In the proposed approach, run-time adaptation decisions are taken by each application, autonomously. This has two main advantages: i) a non-invasive application design, in terms of source code, and ii) a very low run-time overhead, since it does not require any central coordination of a supervisor nor communication between the applications. We carried out an experimental campaign by using a video processing application -an OpenCL stereo-matching implementation -and stressing out resource usage. We proved that, while RTRM is necessary to provide lower variance of the application performance, the application auto-tuning layer is fundamental to trade it off with respect to the computation accuracy
An extreme-scale virtual screening platform for drug discovery
Virtual screening is one of the early stages that aims to select a
set of promising ligands from a vast chemical library. Molecular
Docking is a crucial task in the process of drug discovery and it
consists of the estimation of the position of a molecule inside the
docking site. In the contest of urgent computing, we designed from
scratch the EXSCALATE molecular docking platform to benefit
from heterogeneous computation nodes and to avoid scaling issues